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made disasters more unpredictable than before. In addition, responses are 
often late due to manual processes that have to be performed by experts. 
Consequently, major advances in computer vision (CV) have prompted 
researchers to develop smart models to help these experts. We need a strong 
image representation model, but at the same time, we also need to prepare 
for a deep learning environment at a low cost. This research attempts to 
develop transfer learning models using low-cost masking pre-processing in 
the experimental building damage (xBD) dataset, a large-scale dataset for 
advancing building damage assessment. The dataset includes eight types of 
disasters located in fifteen different countries and spans thousands of square 
kilometers of satellite images. The models are based on U-Net, i.e., Alex Net, 
visual geometry group (VGG)-16, and ResNet-34. Our experiments show 
that ResNet-34 is the best with an Fl score of 71.93%, and an intersection 


over union (IoU) of 66.72%. The models are built on a resolution of 1,024 
pixels and use only first-tier images compared to the state-of-the-art 
baseline. For future orientations, we believe that the approach we propose 
could be beneficial to improve the efficiency of deep learning training. 
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1. INTRODUCTION 

A considerable amount of unprecedented weather changes around the world have made disasters more 
unpredictable and more severe than before [1]. On the other hand, the advance in machine learning (ML) and 
computer vision (CV) has brought computer science algorithms the capability of building intelligent and 
independent solutions for disaster prevention all around the world. Additionally, the increasing availability of 
satellite images from the United States and European scientific agencies, such as the united states geological 
survey (USGS), national oceanic and atmospheric administration (NOAA), and European space agency (ESA) 
has further cultivated more and more research on ML and CV with the help of domain experts, such as 
humanitarian assistance and disaster recovery (HADR) and remote sensing experts [2]-[4]. Training accurate 
and robust CV models needs large-scale and a variety of datasets; moreover,all buildings have different designs 
from one another. The differences between designs depend on locations or countries where the buildings are 
located. It may seem a challenge for CV models to recognize all types of building from various places. 
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The experimental building damage (xBD) dataset [2] comprises satellite images utilized for 
detecting building shapes and assessing building damages. Furthermore, the dataset encompasses eight types 
of disasters located in fifteen different countries and covers thousands of square-kilometer satellite images. 
The dataset consists of pairs of images; specifically, the first and second images represent conditions of a 
region before and after a disaster respectively. Additionally, the dataset has been annotated in javascript 
object notation (JSON) form; therefore, there is no need for further annotation processes. This research 
attempts to build CV models which are capable of detecting and segmenting building shapes on satellite 
images before and after disasters occur. 

One of the important issues in image processing is the complexity during the feature extraction 
process. In this sense, we need a powerful image representation model, but on the other hand, we also need to 
prepare for a low-cost deep learning environment. In this research, our main research question is thus, how to 
prepare a simple yet powerful image preprocessing for transfer learning. 

The transfer learning approach has been chosen for the approach of this research because the 
technique has utilized best practices for state-of-the-art models [5]—[7]. Particularly, the trained models for 
detecting building shapes from given images employ convolutional neural networks (CNN) architectures 
such as AlexNet [8], visual geometry group (VGG) [9], and ResNet [10]. Furthermore, we postulate that by 
using a low complexity pre-processing algorithm, the entire transfer learning process will be more efficient. 


2. METHOD 
2.1. State-of-the-art techniques 

Image segmentation refers to segmenting or partitioning an image into different areas, with each 
area commonly representing a class. Specifically, CV techniques can be employed on satellite images to 
extract a partition of the image as an object of a predefined class. Various techniques for satellite image 
segmentation consist of thresholding, clustering, region-based, and artificial neural networks (ANN). Among 
those techniques, ANN proves to be giving the best accuracy [11]. 

CNN is known as one of the deep learning techniques used for CV tasks. Specifically, CNN is 
developed from multilayer perceptron (MP) to process two-dimensional data such as images [7], [12], [13]. 
CNN technique has three layers which are divided into two main parts, feature learning, and classifier parts. 
The feature learning part consists of convolution layers and pooling layers. The classifier part comprises a 
fully connected layer. Arrangements of CNN shall construct various forms of CNN architectures such as 
AlexNet [8], VGG [9], and ResNet [10]. 

U-Net has the capability of processing large-size images and generating outputs whose sizes are the 
same as the ones of inputs. Another advantage of U-Net is the processing speed which is constant during the 
training phase. The U-Net training process adopts the CNN training method which replaces a pooling 
operation with the upsampling operation so the convolutional and pooling layers of the model can return the 
size of an input image [14]. The u-Net architecture resembles a letter U which is divided into contracting and 
expansive parts. A contracting part tackles the feature extraction process while an expansive part involves 
transferring features and reconstructing images to the original input size. 

Previous satellite image datasets before xBD only cover one type of natural disaster with various 
label criteria for damaged buildings [4], [15], [16]. Furthermore, datasets [17], and [18] provide locations of 
disaster occurrences; however, these datasets do not include damaged building structure images. There are 
also datasets with multi-view imagery such as change detection and land classification [19]-[21] where 
several visits to one site and a time series of satellite images are provided. Prominent satellite image 
segmentation techniques are applied to road segmentation; specifically, the techniques are unsupervised [22], 
[23]. However, there are limited amounts of literature that discuss road segmentation and identification with 
obstructions. Other segmentation approaches to detect damaged buildings propose a ML model trained on 
non-building shapes. [24]. Ronneberger et al. [14] develop a U-Net architecture whose model is specifically 
designed to segment objects in medical images with a limited size of training data. They employ both the 
Glioblastoma-astrocytoma U373 cells on a polyacrylamide substrate (PhC-U373) and the Henrietta Lacks 
cells on a flat glass recorded by differential interference contrast microscopy (DIC-HeLA) datasets to 
measure the model's intersection over union (IoU) value. The IoU values for PhC-U373 and DIC-HeLa 
datasets are 0.9203 and 0.7756 respectively. Gupta et al. [2] establish a baseline model for the xBD dataset. 
Particularly, they utilize SpaceNet, a variant of U-Net architecture as shown in Figure 1. The IoU values of 
their model for ground and building are 0.97 and 0.66 respectively. Kurama et al. [11] use U-Net architecture 
trained on 2,000 images of the defence science and technology laboratory (DSTL) dataset and achieve 98% 
accuracy. 
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Figure 1. U-Net architecture [10] 


2.2. Contributions 
This research contributes to CV recent literature in the following aspects: 
i) We experimented with a lightweight masking preprocessing procedure for the disaster images in the 
xBD dataset which gives low complexity yet powerful feature extraction in the U-Net architectures. 
ii) We compare several variants of CNN U-Net architectures utilized for detecting building shapes before 
and after disasters from the xBD dataset. The CNN segmentation techniques analyzed in this research 
are AlexNet, VGG-16, and ResNet-34 as these techniques are the most widely used in the literature [5]. 
We believe that this research shall give some insights into the masking preprocessing procedure and 
its potential during transfer learning. As far as we know. Our research is the first which compares the original 
experiment in the xBD dataset in various U-Net architectures. 


2.3. Experiments 
2.3.1. Dataset 

This research uses the xBD dataset which is one of the publicly available annotated satellite images 
with high resolution. The dataset has more than 850,000 polygons for 22,000 building images from six types 
of disasters worldwide, which encompass more than 45,000 square kilometers [2]. The dataset annotations 
are done by experts in their fields such as California air national guard (CAL FIRE) and federal emergency 
management agency (FEMA). Each satellite image has red green blue (RGB) values which form three 
squares of 1,024 pixels. In this research, the first tier of the dataset is used and divided by xView2 into two 
portions, train and validation set. The number of images in the train set and validation set is 5,598 and 1,866 
respectively which consist of the types of disasters described in Table 1. 


Table 1. Number of images for each disaster 
Number of images 


iis eas Train Validation 
guatemalare-volcano 36 10 
hurricane-orence 638 238 
hurricane-harvey 638 190 
hurricane-matthew 476 188 
hurricane-michael 686 218 
mexico-earthquake 242 68 
midwest-flooding 558 172 
palu-tsunami 226 82 
santa-rosa-wildfire 452 154 
socal-fire 1,646 546 1,646 546 
Total 5,598 1,866 


2.3.2. Image preprocessing 
The xBD dataset annotations are saved into JSON format and one of the annotations is building 
information coordinates on an image. Furthermore, this coordinate information is preprocessed into creating 
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a masking image [25]. The masking image consists of two classes, which are ground and building. A zero- 
value pixel in a masking image refers to a ground; on the other hand, a one-value pixel indicates a building. 
Figures 2 and 3 show an image before and after the masking process is applied. Furthermore, the masking 
image is used as a label or target during the training of a CV model. 


Figure 2. An image before masking Figure 3. An image after masking is applied 


2.3.3. Model training 
A model (f) is trained on satellite images to detect buildings at pixel levels shows in Algorithm 1, 
that is: 


Algorithm 1 Preprocessing images algorithm 

1: procedure Preprocessing (images, json_file) 

2: read the json_file containing building coordinates 

3: for each image in images do 

4: for each pixel (i, j) in the image do 

5: if (i, 3) is part of a building then #utilize the JSON file 
6: (4, j) = 1 

7: else 

8: (i, j) = 0 


For every pixel in an image, pij with (i; j) as the coordinate of the pixel. This training method is a well- 
known technique known as image segmentation in CV literature [26]. We opt to choose the transfer learning 
approach as this approach gives the best performance results which are elaborated by Raffel et al. [27]. The 
convolutional base of CNN has been trained on the ImageNet dataset [5]; therefore, the xBD dataset is 
normalized by the statistics of ImageNet to have the same range of input distribution [28]. An illustration of 
the transfer learning approach is Figure 4. 
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Figure 4. Transfer learning approach illustration 
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The transfer learning approach utilizes a convolutional base learner which has learned a lot of 
features from a dataset for a specific task. Next, this knowledge will be used to perform the task on a 
different dataset without initializing weights randomly. If the dataset is quite large, the weights of the model 
can be updated wholly; this training process is commonly called fine-tuning. Similarly, our model undergoes 
a two-stage training process. Firstly, only the head of the model is trained on the dataset. Next, the model is 
trained for updating the weights of all layers [29]. 

The deep learning library which was used during the training is fast.ai which is run on nl-highmem- 
4 and graphics processing unit (GPU) NVidia tesla T4 of google cloud platform for 4 days the learning rate is 
0.0003 obtained from the cyclical learning rate finder algorithm [30]. During training, data augmentation 
techniques such as flipping images horizontally, rotating images, magnifying images, adjusting brightness, 
contrasting images, and wrapping images are also used. In addition, the performance parameters for this task 
are precision, recall, and Fl, given in (1)-(3), with true positive (TP), false positive (FP), and false negative 
(FN) carefully assessed. 


Precision = —“— (1) 
TP +FP 
Recall = —“— (2) 
TP + FN 


2x Precision x Recall 
F1= (3) 


Precision + Recall 


Additionally, IoU metric in (4), the metric used in Gupta et al. [2], is also utilized to evaluate our model. 


Area of Overlap 


IoU = (4) 


Area of Union 


3. RESULTS AND DISCUSSION 

Three CNN-based architectures, i.e.: AlexNet, VGG-16, and ResNet-34, are trained on 512 by 512- 
pixel images with 10 epochs. Our best-performing models are chosen based on the FI score because of the 
imbalance between ground and building image instances in our dataset. The comparison of the three models 
when only the heads are trained is displayed in Table 2. 

The best model among the three models, that is ResNet-34 is trained on 512 and 1,024 pixels on the 
head only with the number of epochs of 40 and a learning rate of 0.0003. Next, all layers are fine-tuned with 
a learning rate ranging from 0.000001 to 0.0001. Results of the training process are Tables 3 and 4. Both 
tables display that the models give better F1 scores and IoU results than the ones in Table 2. 


Table 2. Comparison of the three models at the tenth epoch 


Model Accuracy Precision _ Recall Fl Score 
AlexNet 0.950 0.640 0.271 0.357 
VGG-16 0.958 0.696 0.391 0.474 

ResNet-34 0.966 0.700 0.674 0.683 


Table 3. Training ResNet-34 model at 512 pixels resolution 


Train Accuracy Precision Recall Fl Score Mean IoU Building 
Head 0.974 0.803 0.708 0.751 0.592 
Fine-tuning 0.975 0.804 0.720 0.758 0.609 


Table 4. Training ResNet-34 model at 1,024 pixels resolution 


Train Accuracy Precision _ Recall Fl Score Mean IoU Building 
Head 0.978 0.789 0.681 0.719 0.667 
Fine-tuning 0.978 0.791 0.676 0.717 0.669 


Figure 5 presents a sample of our ground truth pixel values, while Figure 6 presents the predictions. 
The performances of the trained model on the validation set are measured by IoU [14], specifically the loU 
building. Table 5 (512 pixels) and Table 6 (1,024 pixels) depict the segmentation results and IoU values of the 
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validation set from ten disasters. Image segmentation of hurricane-matthew gives the least value while the one 
of guatemala-volcano surprisingly displays a good result considering the size of its dataset which is the least. 


Label: true_imsegment : Label: 
Building Building 
x-pixel 
Figure 5. The ground truth pixel values of one Figure 6. The predicted pixel values of the sample. 
sample in the validation set. The image size is The image size is 1,024x1,024 pixels (in the x and y- 
1,024x1,024 pixels (in the x and y-axis directions) axix directions) 


Table 5. IoU of disasters at 512 pixels resolution 
IoU segmentation at 512 pixels per disaster 


Disaster Training Head Fine Tuning 
IoU ground IoU building IoU ground IoU building 
guatemala-volcano 0.992716 0.516159 0.992850 0.528130 
hurricane-florence 0.996835 0.651713 0.996637 0.666267 
hurricane-harvey 0.976307 0.672333 0.975674 0.688640 
hurricane-matthew 0.993617 0.276589 0.993091 0.314112 
hurricane-michael 0.986097 0.675072 0.985711 0.689483 
mexico-earthquake 0.905966 0.671344 0.902866 0.687535 
midwest- 0.994258 0.640343 0.994310 0.656130 
palu-tsunami 0.953890 0.700680 0.947037 0.729558 
santa-rosa-wildfire 0.986534 0.623966 0.986657 0.638125 
socal-fire 0.996651 0.532794 0.996702 0.541918 


Table 6. IoU of disasters at 1,024 pixels resolution 
IoU segmentation at 512 pixels per disaster 


Disaster Training Head Fine Tuning 
IoU ground IoU building IoU ground IoU building 
guatemala-volcano 0.995799 0.582504 0.995598 0.577696 
hurricane-florence 0.997853 0.744014 0.997796 0.749505 
hurricane-harvey 0.978948 0.734031 0.979413 0.731891 
hurricane-matthew 0.994308 0.364812 0.994263 0.375385, 
hurricane-michael 0.988052 0.742830 0.987936 0.742655 
mexico-earthquake 0.914349 0.705674 0.916219 0.700831 
midwest- 0.996147 0.726253 0.996176 0.726788 
palu-tsunami 0.957746 0.742502 0.958971 0.744839 
santa-rosa-wildfire 0.989383 0.708836 0.989252 0.700055 
socal-fire 0.997107 0.611816 0.997081 0.614974 


4. CONCLUSION 

This research delves into satellite image segmentation using a U-Net architecture with convolutional 
bases such as AlexNet, VGG-16, and ResNet-34. The final model is ResNet-34 with an accuracy of 
0.978409, precision of 0.789098, recall of 0.681466, and Fl-score of 0.719300 when the head of the model is 
trained. The mean of the IoU is 0.667237, and this number is similar to the IoU of our baseline as reported in 
the initial xBD dataset exploration. However, our research utilizes a smaller dataset, which is only the first 
tier compared to the baseline. Moreover, our architecture is simpler than the one of the baseline, that is 
ResNet-34. We also trained the model in 4 days compared to the baseline which is in 7 days. These 
advantages can be achieved because of the transfer learning approach. For future directions, we believe that 
our proposed method can be beneficial to improve the training efficiency in deep learning. It is strongly 
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recommended to cooperate with satellite image experts to obtain in-depth interpretation and information. 
Furthermore, a greater number of images should also give better performances at detecting buildings from 
satellite images. Consequently, models can be improved to detect levels of damage to buildings after 
successful segmentation. 
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